Applying WebTables in Practice

نویسندگان

  • Sreeram Balakrishnan
  • Alon Y. Halevy
  • Boulos Harb
  • Hongrae Lee
  • Jayant Madhavan
  • Afshin Rostamizadeh
  • Warren Shen
  • Kenneth Wilder
  • Fei Wu
  • Cong Yu
چکیده

We started investigating the collection of HTML tables on the Web and developed the WebTables system a few years ago [4]. Since then, our work has been motivated by applying WebTables in a broad set of applications at Google, resulting in several product launches. In this paper, we describe the challenges faced, lessons learned, and new insights that we gained from our efforts. The main challenges we faced in our efforts were (1) identifying tables that are likely to contain high-quality data (as opposed to tables used for navigation, layout, or formatting), and (2) recovering the semantics of these tables or signals that hint at their semantics. The result is a semantically enriched table corpus that we used to develop several services. First, we created a search engine for structured data whose index includes over a hundred million HTML tables. Second, we enabled users of Google Docs (through its Research Panel) to find relevant data tables and to insert such data into their documents as needed. Most recently, we brought WebTables to a much broader audience by using the table corpus to provide richer tabular snippets for fact-seeking web search queries on Google.com.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evidence based medicine in nuclear medicine practice; Part II: Appraising and applying the evidence

  As described in the first part of this article, Evidence Based Medicine (EBM) is a growing part of medical practice which emphasizes on the best evidence. Finding this evidence by formulating an answerable question and searching strategies were described in the first part of this review. In this part, appraising the retrieved article (with the main focus on the diag...

متن کامل

QUICK: Expressive and Flexible Search over Knowledge Bases and Text Collections

Recent work on Web-extracted data sets has produced an interesting new source of structured Web data. These data sets can be viewed as knowledge bases (KB) – large heterogeneous linked entity collections with millions of unique edge and node labels, often encoding rich semantic information over entities. For example, YAGO [5] and ExDB [2] have fact collections numbering in the tens and hundreds...

متن کامل

Schema Extraction for Tabular Data on the Web

Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter’s interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables — attribute names, values, data types, etc. — are not explicitly stored as table metadata. Consequently, the structure that these tables contain is...

متن کامل

Uncovering the Relational Web

The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these stru...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015